Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary file size, which is an approach known as space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler. On modern processors, loop unrolling is often counterproductive, as the increased code size can cause more cache misses; cf. Duff's device.
The goal of loop unwinding is to increase a program's speed by reducing or eliminating instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration; reducing branch penalties; as well as hiding latencies, including the delay in reading data from memory. To eliminate this computational overhead, loops can be re-written as a repeated sequence of similar independent statements.
Loop unrolling is also part of certain formal verification techniques, in particular bounded model checking. Model Checking Using SMT and Theory of Lists
Optimizing compilers will sometimes perform the unrolling automatically, or upon request.
for (int x = 0; x < 100; x++) {
remove(x);
}
| for (int x = 0; x < 100; x += 5) {
remove(x);
remove(x + 1);
remove(x + 2);
remove(x + 3);
remove(x + 4);
}
|
As a result of this modification, the new program has to make only 20 iterations, instead of 100. Afterwards, only 20% of the jumps and conditional branches need to be taken, and represents, over many iterations, a potentially significant decrease in the loop administration overhead. To produce the optimal benefit, no variables should be specified in the unrolled code that require pointer arithmetic. This usually requires "Base address plus offset" addressing, rather than indexed referencing.
On the other hand, this manual loop unrolling expands the source code size from 3 lines to 7, that have to be produced, checked, and debugged, and the compiler may have to allocate more registers to store variables in the expanded loop iteration . In addition, the loop control variables and number of operations inside the unrolled loop structure have to be chosen carefully so that the result is indeed the same as in the original code (assuming this is a later optimization on already working code). For example, consider the implications if the iteration count were not divisible by 5. The manual amendments required also become somewhat more complicated if the test conditions are variables. See also Duff's device.
for i := 1:8 do
if i mod 2 = 0 then do_even_stuff(i)
else do_odd_stuff(i);
next i;
| do_odd_stuff(1); do_even_stuff(2);
do_odd_stuff(3); do_even_stuff(4);
do_odd_stuff(5); do_even_stuff(6);
do_odd_stuff(7); do_even_stuff(8);
|
But of course, the code performed need not be the invocation of a procedure, and this next example involves the index variable in computation:
x(1) := 1;
for i := 2:9 do
x(i) := x(i - 1) * i;
print i, x(i);
next i;
| x(1) := 1;
x(2) := x(1) * 2; print 2, x(2);
x(3) := x(2) * 3; print 3, x(3);
x(4) := x(3) * 4; print 4, x(4);
... etc.
|
which, if compiled, might produce a lot of code ( print statements being notorious) but further optimization is possible. This example makes reference only to x(i) and x(i - 1) in the loop (the latter only to develop the new value x(i)) therefore, given that there is no later reference to the array x developed here, its usages could be replaced by a simple variable. Such a change would however mean a simple variable whose value is changed whereas if staying with the array, the compiler's analysis might note that the array's values are constant, each derived from a previous constant, and therefore carries forward the constant values so that the code becomes
print 2, 2;
print 3, 6;
print 4, 24;
...etc.
In general, the content of a loop might be large, involving intricate array indexing. These cases are probably best left to optimizing compilers to unroll. Replicating innermost loops might allow many possible optimisations yet yield only a small gain unless n is large.
WHILE (condition) DO
action
ENDWHILE
.
.
.
.
.
.
| WHILE (condition) DO
action
IF NOT(condition) THEN GOTO break
action
IF NOT(condition) THEN GOTO break
action
ENDWHILE
LABEL break:
.
| IF (condition) THEN
REPEAT
action
IF NOT(condition) THEN GOTO break
action
IF NOT(condition) THEN GOTO break
action
WHILE (condition)
LABEL break:
|
Even better, the "tweaked" pseudocode example, that may be performed automatically by some optimizing compilers, eliminating unconditional jumps altogether.
Assembly language programmers (including optimizing compiler writers) are also able to benefit from the technique of dynamic loop unrolling, using a method similar to that used for efficient . Here, the advantage is greatest where the maximum offset of any referenced field in a particular array is less than the maximum offset that can be specified in a machine instruction (which will be flagged by the assembler if exceeded).
LM R15,R2,INIT Set R15 = maximum number of MVC
SR R15,R0 Subtract the remaining number of
BNP ALL If R15 is not positive, meaning we
MH R15,=AL2(ILEN) Multiply R15 by the length of one
B ALL(R15) Jump to ALL+R15, the address of the
MVC 14*256(100,R2),14*256(R1) Move 100 bytes of 15th entry.
MVC 13*256(100,R2),13*256(R1) Move 100 bytes of 14th entry.
MVC 12*256(100,R2),12*256(R1) Move 100 bytes of 13th entry.
MVC 11*256(100,R2),11*256(R1) Move 100 bytes of 12th entry.
MVC 10*256(100,R2),10*256(R1) Move 100 bytes of 11th entry.
MVC 09*256(100,R2),09*256(R1) Move 100 bytes of 10th entry.
MVC 08*256(100,R2),08*256(R1) Move 100 bytes of 9th entry.
MVC 07*256(100,R2),07*256(R1) Move 100 bytes of 8th entry.
MVC 06*256(100,R2),06*256(R1) Move 100 bytes of 7th entry.
MVC 05*256(100,R2),05*256(R1) Move 100 bytes of 6th entry.
MVC 04*256(100,R2),04*256(R1) Move 100 bytes of 5th entry.
MVC 03*256(100,R2),03*256(R1) Move 100 bytes of 4th entry.
MVC 02*256(100,R2),02*256(R1) Move 100 bytes of 3rd entry.
MVC 01*256(100,R2),01*256(R1) Move 100 bytes of 2nd entry.
MVC 00*256(100,R2),00*256(R1) Move 100 bytes of 1st entry.
S R0,MAXM1 Reduce the number of remaining entries
BNPR R14 If no more entries to process, return
AH R1,=AL2(16*256) Increment 'FROM' array pointer beyond
AH R2,=AL2(16*256) Increment 'TO' array pointer beyond
L R15,MAXM1 Reload the maximum number of MVC
B LOOP Execute loop again.
DC A(FROM) Address of start of array 1
DC A(TO) Address of start of array 2
In this example, approximately 202 instructions would be required with a "conventional" loop (50 iterations), whereas the above dynamic code would require only about 89 instructions (or a saving of approximately 56%). If the array had consisted of only two entries, it would still execute in approximately the same time as the original unwound loop. The increase in binary file size is only about 108 bytes even if there are thousands of entries in the array.
Similar techniques can of course be used where multiple instructions are involved, as long as the combined instruction length is adjusted accordingly. For example, in this same example, if it is required to clear the rest of each array entry to nulls immediately after the 100 byte field copied, an additional clear instruction, XC xx*256+100(156,R1),xx*256+100(R2), can be added immediately after every MVC in the sequence (where xx matches the value in the MVC above it).
It is, of course, perfectly possible to generate the above code "inline" using a single assembler macro statement, specifying just four or five operands (or alternatively, make it into a library subroutine, accessed by a simple call, passing a list of parameters), making the optimization readily accessible.
// The number of entries processed per loop iteration. // Note that this number is a 'constant constant' reflecting the code below. constexpr int BUNCHSIZE = 8
int main(void) {
int i = 0; // counter
int entries = 50; // total number to process
// If the number of elements is not divisible by BUNCHSIZE,
// get repeat times required to do most processing in the while loop
int repeat = (entries / BUNCHSIZE); // number of times to repeat
int left = (entries % BUNCHSIZE); // calculate remainder
// Unroll the loop in 'bunches' of 8
while (repeat--) {
printf("process(%d)\n", i);
printf("process(%d)\n", i + 1);
printf("process(%d)\n", i + 2);
printf("process(%d)\n", i + 3);
printf("process(%d)\n", i + 4);
printf("process(%d)\n", i + 5);
printf("process(%d)\n", i + 6);
printf("process(%d)\n", i + 7);
// update the index by amount processed in one go
i += BUNCHSIZE;
}
// Use a switch statement to process remaining by jumping to the case label
// at the label that will then drop through to complete the set
switch (left) {
case 7:
printf("process(%d)\n", i + 6); // process and rely on drop through
case 6:
printf("process(%d)\n", i + 5);
case 5:
printf("process(%d)\n", i + 4);
case 4:
printf("process(%d)\n", i + 3);
case 3:
printf("process(%d)\n", i + 2);
case 2:
printf("process(%d)\n", i + 1); // two left
case 1:
printf("process(%d)\n", i); // just one left to process
case 0:
break; // none left
}
}
The following example will compute a dot product of two 100-entry vectors A and B of type double. Here is the code in C: dotProduct += A[i] * B[i];
}
loop3:
l.d $f10, 0($5) ; $f10 ← A[i]
l.d $f12, 0($6) ; $f12 ← B[i]
mul.d $f10, $f10, $f12 ; $f10 ← A[i]*B[i]
add.d $f8, $f8, $f10 ; $f8 ← $f8 + A[i]*B[i]
addi $5, $5, 8 ; increment pointer for A[i] by the size
; of a double.
addi $6, $6, 8 ; increment pointer for B[i] by the size
; of a double.
addi $7, $7, -1 ; decrement loop count
test:
bgtz $7, loop3 ; Continue if loop count > 0
loop3:
l.d $f10, 0($5) ; iteration with displacement 0
l.d $f12, 0($6)
mul.d $f10, $f10, $f12
add.d $f8, $f8, $f10
l.d $f10, 8($5) ; iteration with displacement 8
l.d $f12, 8($6)
mul.d $f10, $f10, $f12
add.d $f8, $f8, $f10
l.d $f10, 16($5) ; iteration with displacement 16
l.d $f12, 16($6)
mul.d $f10, $f10, $f12
add.d $f8, $f8, $f10
l.d $f10, 24($5) ; iteration with displacement 24
l.d $f12, 24($6)
mul.d $f10, $f10, $f12
add.d $f8, $f8, $f10
addi $5, $5, 32
addi $6, $6, 32
addi $7, $7, -4
test:
bgtz $7, loop3 ; Continue loop if $7 > 0
|
|